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Abstract 

High-dimensional gene expression data provide a rich source of information because they capture the expression 
level of genes in dynamic states that reflect the biological functioning of a cell. For this reason, such data are suitable 
to reveal systems related properties inside a cell, e.g., in order to elucidate molecular mechanisms of complex diseases 
like breast or prostate cancer. However, this is not only strongly dependent on the sample size and the correlation 
structure of a data set, but also on the statistical hypotheses tested. Many different approaches have been developed 
over the years to analyze gene expression data to (I) identify changes in single genes, (II) identify changes in gene sets 
or pathways, and (III) identify changes in the correlation structure in pathways. In this paper, we review statistical 
methods for all three types of approaches, including subtypes, in the context of cancer data and provide links to 
software implementations and tools and address also the general problem of multiple hypotheses testing. Further, we 
provide recommendations for the selection of such analysis methods. 

Reviewers: This article was reviewed by Arcady Mushegian, Byung-Soo Kim and Joel Bader. 
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Review 

Background 

The early driving forces in biology were reductionist 
approaches. In general, a reductionist approach tries to 
break-down a complex system into its parts list and 
explains its properties as the sum of its individual com- 
ponents. Hence, the individual constituents of a system 
inform its higher level functions [1-4]. However, the one 
gene, one protein, one function working hypothesis [5] 
is not sufficient in order to explain the many emergent 
properties such as the phenotypic variability of organ- 
isms or the heterogeneity of cancer [6]. For this reason, 
nowadays, it is generally acknowledged that for achiev- 
ing a functional understanding of biological systems, the 
genes in a cell need to be studied as a functioning collec- 
tive [2,3,7]. In such a system, the collective functioning of 
groups of genes results in, for instance, signaling pathways 
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or protein complexes that regulate cell differentiation, 
transcription regulation or growth. 

A systems integration at the cellular level has the poten- 
tial to answer many, until now, unsolved questions about 
biological systems and their collective functioning, reg- 
ulatory programs for growth, development, phenotypic 
variability and the causality of many complex diseases 
[8-10]. Due to the enormous complexity of a cellular sys- 
tem, where many processes and interactions at different 
levels inside a cell work in harmony to assure the vital 
functioning of a cell, we need to understand key proper- 
ties of biological systems like its robustness or modularity 
[2,8] in order to enhance our understanding of complex 
diseases. These complex interactions occurring within 
a cell can be described by networks [11-13], including 
gene regulatory networks [14,15], protein-protein inter- 
action (PPI) networks [16,17], metabolic networks [18] 
and transcription regulatory networks [19,20]. The net- 
works are organized at different cellular levels and enable 
the functionality of the cell. The question now arising is 
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how can the complexity inside a cell be understood, and 
analyzed? 

The development of information processing technolo- 
gies in the post genomic era enabled the generation of 
huge amounts of data. In this review, we focus on gene 
expression data from microarray platforms and summa- 
rize three major types of analysis strategies: (I) Identi- 
fication of changes in single genes, (II) identification of 
changes in gene sets or pathways, and (III) identification of 
changes in the correlation structure within pathways. We 
discuss these methods in the context of cancer data sets 
to emphasize their biological meaning, implications and 
expressiveness. 

Large-scale gene expression data 

In the next section, we briefly review high-throughput 
technologies that enable the generation of large-scale gene 
expression data [21-23]. 

Gene expression data from microarray 

A microarray experiment measures genome-wide gene 
expression levels of mRNA in a cell or a tissue sample 
under a particular condition. A microarray chip quantifies 
the hybridization of fluorecsent labeled target nucleotide 
sequences to defined complementary probe sequences 
that are spotted on a glass or silicon slide. For differ- 
ent microarray platforms the spotted probes are synthetic 
oligonucleotides ranging from 25 to 80 nucleotides or long 
cDNA transcripts. Different microarray platforms were 
designed for a single-channel or a multi-channel experi- 
mental setting. For single-channel arrays each condition 
sample is hybridized separately on individual arrays using 
a single dye. For multi-channel arrays multiple conditions 
are hybridized together on individual arrays using mul- 
tiple dyes. For example Affymetrix is a single-channel 
platform, where multiple oligonucleotide probes (probe- 
set) of 25 bases are used to measure the concentration of a 
mRNA transcript. The target mRNAs of expressed genes 
are extracted from a treatment or a control sample, reverse 
transcribed to cDNAs, labeled with a fluorescent dye and 
then hybridized to a microarray. An image of the microar- 
ray captures laser induced emitted fluoresent intensities of 
the probes at each spot. The intensities give a proportional 
measure of the corresponding mRNA concentration for 
each gene that was denned on the microarray. 

Gene expression data from next generation sequencing 
(RNA-seq) 

The transcriptome of a cell comprises mRNA, tRNA, 
rRNA, and short regulatory RNAs. RNA-seq is a tran- 
scriptome sequencing approach that uses deep sequenc- 
ing techniques such as 454 (Roche), genome analyzer 
(Illumina solexa), SOLiD (support oligonucleotide lig- 
ation detection), Polonator G.007, HeliScope (Helicos 



Biosciences) and SMRT (single molecule real time 
sequencing) [24]. 

RNA-seq has a wide variety of applications such as the 
measurement of gene expression levels from transcribed 
mRNA sequences [25]. In the first step of the procedure 
RNA is extracted from a given condition sample, frag- 
mented, reverse transcribed to cDNA that is ligated to 
adapters. In the second step a library of reads is generated 
from the ligated fragments that are sequenced. In the third 
step the reads are mapped to known exon sequences of 
genes. The expression level of a gene is measured from the 
normalized number of mapped sequences that mapped to 
the known set of exon sequences of a gene. The RNA-seq 
transcriptome sequencing approach overcomes several 
limitations of microarrays for measuring gene expression. 
For example, RNA-seq measures large ranges of expres- 
sion levels from very low to highly expressed genes and 
is able to consider unknown transcribed sequences. Since 
the novelty of the methodology, gold standard proce- 
dures for the management and processing of the data are 
currently being established. 

Gene expression data and cancer 

Cancer is a multifactorial disease, i.e., the detection of 
one mutation in one gene cannot explain the phenotypic 
plurality of carcino- and pathogenesis by a one-to-one 
relationship between genotype and phenotype. Instead, 
cancer can be induced by a multitude of genetic and envi- 
ronmental factors and the accumulation of such events. 
The intervening of such complex factors makes in gen- 
eral the characterization of complex diseases difficult. For 
this reason it is astonishing that the seminal work by 
Weinberg et al. [6,26] presented a relative simple, system- 
atic functional framework for cancer and the role different 
biological key processes are playing. In this paper, the so 
called hallmarks of cancer have been defined. According 
to [6], the hallmarks of cancer (see Figure 1) are: 

• self sufficiency in growth signals 

• insensitivity to anti-growth signals 

• evading apoptosis 

• limitless replicative potential 

• sustained angiogenesis 

• tissue invasion and metastasis 

Later, this list has been extended by adding two further 
hallmarks [26] 

• deregulating cellular energetics 

• avoiding immune destruction 

and also two enabling characteristics 

• genome instability 

• mutation and tumor-promoting inflammation 
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It has been recognized that these hallmarks are gradu- 
ally acquired by different types of cancers, potentially, in 
a variable order. This variability in the acquiring of these 
disease-bearing processes is one of the indicators of the 
complexity of cancer. 

The biological processes in a cell are controlled and reg- 
ulated by signaling pathways that are activated by internal 
and external signaling receptors and factors. The sig- 
naling pathways governing growth and cell proliferation 
are likely dysregulated in their functioning in cancer. For 
example, they become insensitive to anti-growth signals, 
or they are dys-regulated in growth signaling pathways 
by gaining autonomy in their growth. It is assumed that 
interaction changes at various levels (genetic, mRNA or 
protein) lead to the unlimited growth of cells instead of the 
up-regulation or down-regulation of a single gene. Fur- 
ther, sometimes, even a moderate change in the expres- 
sion of a group of genes can lead to a significant change in 
the biological function of an organism [27] . 

Currently, the underlying processes that contribute to 
cancer are being intensively investigated. However, so far, 
the molecular causes that initiate and maintain cancer 
are not well understood. For this reason, the understand- 
ing of gene expression profiles, which provide signatures 
of all the active genes and their interconnections in a 
cell, contain valuable information about the functioning 
of key pathways, as expressed by the hallmarks of cancer 



and, hence, enable a practical investigation of functional 
mechanisms thereof [28-36]. Despite the different focus of 
many studies of different cancer types, common themes 
in the form of 'key pathways' can be found throughout. 
For instance, the NF-/cB pathway involved in the cellu- 
lar responses to external stimuli like cytokines or free 
radicals, and immune response to infection [29,37-39]; 
the MAPK signaling pathway responsible for regulating 
growth factor signaling including the RAF, MEK, and 
MAPK cascade [34,39,40]; the p53 signaling pathways 
involved in DNA damage control, apoptosis and inhi- 
bition of angiogenesis [37,41,42]; or the Wnt signaling 
pathway involved in cell differentiation, and cell polarity 
[31,34,36,38]. 

Formulating biological hypotheses 

A main goal of high- throughput gene expression anal- 
ysis is to identify differentially expressed genes or gene 
sets between two or more conditions to enable a func- 
tional interpretation of the underlying condition-specific 
mechanisms. The biological processes at the gene level are 
complex in nature as they dynamically interact with each 
other. A single gene can participate in different biologi- 
cal processes and regulate different genes at different time 
points. The identification of key genes or pathways is a 
difficult task, because their interactions are unknown. We 
only observe the phenotypic outcome of test conditions 



Emmert-Streib etal. Biology Direct 201 2, 7:44 
http://www.biology-direct.eom/content/7/l/44 



Page 4 of 25 



and the corresponding gene expression patterns measured 
from a tissue or cell culture. Univariate and multivariate 
statistical methods can be applied in order to understand 
such differences from a statistical perspective. The first 
type of approach that has been used to identify changes 
in the gene expression is a differential gene expression 
analysis. This approach is commonly used to compare 
different conditions of microarray samples to identify dif- 
ferences between them. As a result, a single gene analysis 
approach gives a list of genes that show a statistically 
significant difference between two conditions. For can- 
cer, such genes may correspond to oncogenes or tumor 
suppressor genes. 

If we consider the underlying network where differ- 
ent biological functions are being described by groups of 
interacting genes, a single gene analysis does not resolve 
the biological functions that are affected primarily in dis- 
ease conditions and are causal factors of the disease. In 
order to get a systematic understanding of the disease 
or phenotypes we have to first understand what biolog- 
ical functions contribute to these changes, and perform 
a comparison between conditions using groups of genes 
defined by biological pathways. This approach leads to 



comparing gene expression data at the pathway level 
where sets of genes are tested for differential expression. 

Another interesting property that can be extracted 
from gene expression data is the correlation structure of 
gene expression profiles between all genes. This correla- 
tion structure shows associations between genes which 
directly or indirectly interact with each other [8,43-45]. 
Comparative analyses of gene pathways that consider the 
correlation structure of expression data can provide a suit- 
able test for the hypothesis of changes in the underlying 
network. 

In summary a gene expression data set can be used to 
(I) identify differentially altered single genes, (II) identify 
differentially expressed gene sets or pathways, and (III) 
identify differentially correlated pathways. In the follow- 
ing sections, we review statistical methods that have been 
introduced to study the three problems (l-lll) above. In 
Figure 2 we give a graphical overview of the such methods. 

Before we procede, we would like to point out that all 
of these methods test statistical hypotheses [46]. That 
implies that in order to understand a particular method 
biologically, i.e., one is capable of providing a biological 
interpretation, one needs to understand the underlying 
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Figure 2 Classification of different univariate and multivariate hypothesis testing based methods for gene expression data analysis. The 

data are used to test differential expression and co-expression between two conditions for single gene and gene set based approaches. 
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null hypothesis. In our opinion, it is helpful to approx- 
imately categorize all statistical hypotheses into three 
categories with respect to their biological interpretability, 
whereas each category represents a different degree of dif- 
ficulty to find a biological interpretation for a hypothesis. 
In the following, we provide a brief discussion of these 
three categories because it enables a better, potentially, 
more plausible understanding of the methods presented 
in the next sections. 

In category one belong all hypotheses for which it is 
relatively easy to find a meaningful biological interpreta- 
tion. An example from this category are tests that compare 
mean values (/x), e.g., to identify the differential expres- 
sion of genes (section 'Differential expression of a gene'). 
That means these tests use the mean as a test statistic. 
Due to the fact that the underlying (probability) distribu- 
tion of the genes represents, biologically, the activity of the 
gene expressions, the interpretation of a null hypothesis 
is directly derived thereof. For this reason the biological 
interpretation of the rejection of the null hypothesis given 
in Eqn. 2, is intuitively clear and appealing, because it 
implies a change in the (mean) expression of genes which 
may indicate a change in a biological function because the 
number of available proteins may be altered. 

In category two fall tests for which there are sev- 
eral alternative biological interpretations. This makes the 
interpretation of such tests ambivalent from a biological 
perspective. As example for such a test, we consider the 
detection of the differential variance of a gene (section 
'Differential variance of a gene'). Despite the fact that the 
underlying probability distribution of the expression of 
genes has a clear biological interpretation, the biologi- 
cal interpretation for the rejection of the null hypothesis 
in Eqn. 4 is not unique. For instance, a gene could have 
a different variance in two conditions because, e.g., in 
condition one it is periodically expressed, whereas in con- 
dition two it is constantly expressed on an intermediate 
level. The former condition may be related to the cell cycle 
or the circadian rhythm, or periodically triggered by an 
external signaling factor that is released by the adminis- 
tration of a medication that is regularly taken. A second 
equally plausible interpretation could be that in one con- 
dition the cell utilizes parallel pathways to transfer a signal 
whereas in the other condition only one signaling chain is 
used. The reason for the utilization of parallel pathways 
could be triggered by stress factors, e.g., in the presence of 
an infection, so that the cell is running' full power in order 
to execute all necessary programs that have been initiated 
by the presence of the intruder. 

Lastly, for tests in category three it is very difficult to 
find sufficiently precise biological interpretations because, 
statistically, these methods test complex' expressions. An 
example for a test from this category is the N-statistic 
(section 'N-statistic'). The null hypothesis is based on 



the comparison of two distributions rather than scalar 
test statistics. In order to clarify the crucial difference 
between the comparison of two distributions and scalar 
test statistics we note that, theoretically, every probabil- 
ity distribution can be written as a series expansion in its 
moments [47] . This means a test for a distribution, com- 
pares implicitly the moments of this distribution. Here the 
(k-th) moment is defined as the expectation value of a 
random variable (to power k), i.e., 

m k = nx k ]. (1) 

An example for a moment is the mean (which is the first 
moment), other examples of entities that can be expressed 
as a function of moments are the variance and the kurto- 
sis. This means whenever the null hypothesis in Eqn. 39 is 
rejected it could be because of a difference in any moment 
of which there are, theoretically, infinite many. Put dif- 
ferently, this kind of unspecificity makes this test very 
powerful in the sense that it may detect any possible differ- 
ence two distributions can exhibit. On the other hand, if 
the null hypothesis is rejected it is very difficult to identify 
a precise reason for its rejection. For instance, this could 
be related to a difference in the mean, variance, kurtosis 
or any higher moment or function thereof. These com- 
binatorial factors do usually not allow to find a concise 
biological interpretation. Nevertheless, such a test can be 
of valuable use, e.g., for diagnostic purposes. 

Single-gene analysis 

Single-gene based methods can be subdivided into three 
major classes. A) Methods for detecting differential gene 
expression, B) methods for detecting differential correla- 
tion, and C) methods for detecting a differential variance. 

Differential expression of a gene 

The analysis of differential gene expression is based on the 
mean expression change of individual genes. Suppose for a 
gene gi in a microarray data set the mean expression value 
for two conditions are \i\ and respectively. Then the 
null hypothesis for the differential expression of the gene 
gi is denned as 

Ho : fix = Ii2 (2) 

Hi: /xi ^ fi 2 (3) 

A gene is called differentially expressed when Hq is 
rejected. Figure 3 A shows an example where the samples 
for two conditions are drawn from two normal distribu- 
tions with different mean values, i.e., N(fii = 0, o\ = 1) 
and A/X/X2 = l,a 2 = 1). 

The first published studies for gene expression analysis 
selected differentially expressed genes based on a fold- 
change criteria between a treatment and control condition 
[48] . For example, an early application of this measure was 
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Figure 3 Examples for different biological hypothesis that can be formulated based on different properties of gene expression data 
(green - control group (condition 1), red - treatment group (condition 2)). E: £ corresponds to the covariance matrix, corresponding to a 
constant but different matrix for the two conditions. 
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used to compare normal colon epithelium and primary 
colon cancers [49] . Since then, many statistical approaches 
have been developed to provide more robust measures. 
Among the most popular methods are, e.g., SAM [50,51], 
limma [52], and the empirical Bayes approach from Efron 
etal. [53]. 

Differential variance of a gene 

The analysis of differential variability (DV) aims to detect 
a change in the variance of the gene expression values [54]. 
Suppose in a microarray expression data set, the mean 
expression value of a gene i is /jl c and its variance a c 2 , 
for condition c = {1,2}. Then, the null and alternative 
hypothesis tested are: 



Ho : o l = a 2 
Hi : a\ ^ a\ 



(4) 
(5) 



A gene is called differentially variable when the null 
hypothesis Hq is rejected. The DV analysis in [54] tests 
Ho by using a F-test. In Figure 3B we show an example 
for a gene with a constant mean, but a changed variance 
in the two conditions. The samples for the two conditions 
are drawn from a standard normal distribution with the 
same mean but different variances for the conditions, i.e., 
AfOi = 0,ai = 1), and N(fi 2 = 0,a 2 = 2). 

Differential correlation of a gene 

The analysis of differential correlation aims to detect 
changes in the dependency structure of a single gene [55]. 
Suppose r; = rn,...,rip denotes a p — 1 dimensional cor- 
relation vector, whereas each component corresponds to 
the correlation between gene i and one of the other p — 1 
genes in a data set. Then for r; one obtains distribution 
functions, denoted by Ff. and Ff?., for condition A and B 
and the following hypotheses: 

(6) 
(7) 

A gene i is called differentially correlated when Hq is 
rejected. 

Gene-pair analysis 

The functional activities of genes, as measured by gene 
expression values, reflect the interplay of the genes and 
their products in the underlying gene network. The objec- 
tive of a gene-pair analysis is to identify either differential 
co-correlated or differential co-expressed pairs of genes, 
instead of individual genes. The reason for looking for 
pairs of genes is that the concerted changes in genes is due 
to their common membership in biological pathways. 

The principle idea to detect correlation changes in gene- 
pairs is visualized in Figure 3D. The data are sampled 
from a multivariate normal distribution with a constant 
mean vector for both conditions, fi\ = fi 2 = (0, 1), but 



H 0 : F r . = F r . 
Hi: 1**1* 



a different correlation of p\ = 0.8 and p\ = —0.2. The 
point is despite no difference in the mean expression of 
the gene-pair, there is a difference in their correlation. 

In [56] a method (CorScor) has been proposed to iden- 
tify such gene-pairs. In Figure 4 we show three cases of 
the joint distribution of expression values of two genes, for 
two conditions. In this Figure we are showing simulated 
data for three possible changes in the co-expression of a 
pair of genes in two conditions. The samples in Figure 4A 
and B are drawn from a multivariate normal distribution 
with fii = {5, 5} and fi2 = {5, 7}. For Figure A the corre- 
lation between gene-pairs is pi = p 2 = 0.9 and for Figure 
B it is p\ = p2 = —0.9. For Figure 4C the samples are 
generated from a multivariate normal distribution with 
fii = {5, 5} and fi2 = {5, 5} and the average correlation 
between gene-pairs is p\ = 0.9 and p 2 = —0.3. 

In the first two cases (Figure 4A and B) the correla- 
tion of the gene-pairs show a condition specific shift, in 
[56] denoted as a gap and substitution. In the third case 
(Figure 4C), the gene-pairs show a reversed correlation 
between the two conditions, denoted as on/off case. To 
identify gene-pairs in these two types of conditions, two 
scoring functions have been suggested in [56] given by: 



s = 



\pA + Pb — &p\ gap/substitution case 



\PA ~ Pb\ 



on/off case 



(8) 



Here each of the three correlation coefficients are esti- 
mated for a gene-pair between gene i and i.e., pa = 
Pa(Uj) etc. The value of p corresponds to the global corre- 
lation coefficient of the gene-pair over the two conditions 
(A,B) and pa, Pb are the correlation coefficients of the 
gene-pair for condition A and condition B. In Eqn. 8, a is 
a tuning parameter that governs the balance between sep- 
aration and parallel alignment. In [56] it was argued to use 
a value of a = 1.5. The null and alternative hypotheses 
tested are: 

Ho : s = 0 (9) 

Hiis^O (10) 

In [57] the expected conditional F-statistic' (ECF- 
statistic) has been introduced to measure the differential 
co-expression of gene pairs (X, Y). The method is based 
on a modified F-statistic, where the variance and the mean 
parameter of the statistic are estimated from a mixture of 
two normal distributions. 

The R package R/EBcoexpress provides an empirical 
Bayesian implementation to identify the differential co- 
expression of gene pairs [58]. 

Another method called liquid association (LA) has been 
proposed in [59] to identify co-expressed gene pairs. In 
contrast to pairwise correlation measures, the LA method 
considers the presence of a mediator gene, Z, for observ- 
ing a co-expression between two genes at a given cellular 
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Figure 4 The joint distribution of the expression of two genes. In 

A and B the expression values of the two genes show a shift in the 
two conditions (red and green). In C the expression values of a gene 
pair are anti-correlated between the two conditions. In A-C the lines 
correspond to linear regression fits of the colored expression values 
and the color dots surrounding the figures provide one-dimensional 
projections of the two-dimensional distributions. 



state. Let X, Y and Z be gene-expression profiles. We say 
that X and Y form a liquid association pair (LAP), if the 
cellular state of X and Y is correlated with Z. The LA 
score of X and Y with respect to Z is estimated from rank 
transformed expression profiles of X, Y, Z given by 

^ m 

LA(X, Y\Z) = - YXiYiZi (11) 
m ^— ' 

i=i 

Here m corresponds to the number of samples. The LA 
method uses a permutation test for the identification of 
significant LA gene pair values. Due to the high computa- 
tional burden of the method that would require N 3 (N is 
the number of genes) evaluations of Eqn 11 plus additional 
permutations of the data, which is even for only N = 10 3 
genes intractable because it requires already more than 
10 9 evaluations. For this reason the method is only used 
to (A) find the gene Z for a given pair of genes or (B) find 
the LAP, X and Y, for a given gene Z. 

Gene set and pathway analysis methods 

Generally, a pathway is a group of interacting genes (a 
gene set) that deploy a cellular function. In a biological 
system the biological processes are coordinated functions 
of sets of genes which make the organism work. Some 
general pathways are, e.g., metabolic pathways, signaling 
pathways or regulation pathways that represent minimal 
functioning units of a cellular system. The consideration 
of pathways or gene sets for a comparative gene expres- 
sion analysis is an important step toward the exploration 
of relevant functional mechanisms of a cell. 

So far, many multivariate and univariate tests have been 
proposed for a gene set analysis, see Figure 2. Find- 
ing differentially expressed pathways, instead of individ- 
ual genes, is not straight forward from a statistical and 
biological perspective and there are several hurdles to 
this approach. The first is presented by the data them- 
selves, because the number of variables is usually (much) 
larger than the number of samples, i.e., n << p, that 
leads to many estimation problems. The second hur- 
dle is our incomplete information about the constitution 
of biological pathways and the potentially high over- 
lap of genes between different pathways. For example, 
databases like GO [60] provide valuable information about 
genes for a large variety of different organisms. However, 
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this information is not static but continuously expanding 
leaving us at the moment with a snap-shot of knowl- 
edge. This makes it difficult to find precise definitions 
for particular pathways of interest. The third problem 
comes from the underlying gene network structure that 
describes the true interactions between genes in a path- 
way. Here, the problem is that as a result of such interac- 
tions among genes it is usually not appropriate to assume 
their independence, as frequently done for statistical 
ease. 

A motivating example for the general idea underlying 
gene set methods is shown in Figure 3C. For condition 1 
(green), the samples are drawn from a multivariate nor- 
mal distribution with fi± = {2, 2.2, 2.4, 2.6, 5, 8} and for 
condition 2 (red) fi2 = {2, 7, 7.5, 2.6, 5, 8}. The covariance 
matrix, Ei = X 2 is for both conditions the same. In this 
Figure, only 2 of the 6 gene are differentially expressed. 
This reflects biological situations because, usually, only of 
fraction of the genes belonging to a pathway is found to 
be differentially expressed. However, due to the fact that 
gene set methods are based on the expression of a set of 
genes such methods borrow strength from the combined 
analysis of the genes. 

Reviews that focus entirely on gene set and pathway 
analysis methods can be found in [61-65]. 

Null hypothesis for gene set analysis 

Gene set analysis methods can be broadly divided into 
two major categories, depending on what null hypothesis 
is tested. The first type of methods are called competi- 
tive methods, and the second type self-contained methods 
[66] . Briefly, self-contained tests use only the data from a 
target gene set under investigation, whereas competitive 
tests use, in addition, also data outside the target gene set 
(background data). In the following we describe popular 
competitive and self-contained pathway methods. 



Competitive gene set and pathway methods 

In Table 1 we show an overview of gene set and pathway 
methods, described in the following. 

GSEA 

The gene set enrichment analysis method (GSEA) [27,68] 
is one of the most widely used competitive test based 
method. The test uses a Kolmogorov-Smirnov test statis- 
tic to identify differential expressed gene sets. The gene set 
and background data set are denned in the following. Let 
W be the target gene set to be tested and W c its comple- 
ment in a way that the union of both sets defines all genes, 
i.e., V = W U W c , in the data set. The hypotheses tested 
by GSEA with respect to an enrichment score (ES) are: 

Hq : ES = Oiyanishingtestscore) 

H\ : ES O(non — vanishingtestscore) 

Briefly, GSEA consists of the following steps, applied to 
each pathway: 

(1) Estimation of gene-level test statistics. 

(2) Rank ordering of the test statistics. 

(3) Calculation of an enrichment score (ES) for a 
pathway based on the gene-level test statistics. 

(4) Permutation of the gene-labels to estimate the 
significance of the enrichment score for the pathway. 

GSEArot 

GSEArot (gene set enrichment analysis rotation) [70] is 
very similar to GSEA, but uses a different approach to 
randomize data in order to assess the significance of a 
target pathway. More specifically, a data matrix X is ran- 
domized by, first, rotating X around a random angle 5, 
resulting in a matrix X(8). Second, from the matrix X(<5), 



Table 1 Overview of different competitive gene set methods 


Principle Method 


Reference 


Test type 


Software 


Over-representation analysis (hyper- 
geometric test) 


[67] 


parametric 


GOstats 


GSEA 


[68] 


non-parametric 


GSEABase 




[27] 


non-parametric 


www.broad.mit.edu/gsea 


GSA 


[69] 


non-parametric 


http://cran.r-project.org/web/packages/GSA/ 


GSEArot 


[70] 


non-parametric 


limma 


GAGE 


[71] 


parametric 


GAGE 


PAGE 


[72] 


parametric 


PGSEA, GAGE 


Random Set 


[73] 


parametric 


part of CLEAN 


Generalized Random Sets 


[74] 


parametric 


http://GenomicsPortals.org 


Gene set enrichment analysis made 


[75] 


parametric 





simple 



If available, the name of the software package is provided. 
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the randomization matrix is obtained by a QR decompo- 
sition [76]. In [70] it is argued that this procedure has an 
advantage for small sample sizes, when only very few per- 
mutations are achievable from sample-label permutations. 
The null hypothesis tested by GSEArot is the same as for 
GSEA. 



Random set 

The random set method introduced in [73] is a paramet- 
ric test that is a generalization of Fisher s exact test in the 
sense that enrichment scores of gene sets are compared 
with randomly formed sets. The enrichment scores are 
based on single gene-level test statistics reflecting their 
differential expression. 

1. Estimate the enrichment score of a target gene set W, 
m ^ 



ieW 



Here s; are gene-level scores, e.g., t-scores, and 
m = | W\ is the number of genes in the target 
pathway. 

2. Estimate the enrichment score and its variance of the 
background gene set V = W U W c , 



11 = \H l 



(13) 



ieV 



m\p-\)\ p V P ) / 



with p = \ WU W c \. 
3. Estimate the standardized enrichment score 



Z = 



S — [A 

a 



(14) 



(15) 



The Z score follows a standard normal distribution 
under the null hypothesis Ho given by: 

Ho : The target set W is not enriched for differentially 
expressed genescompared with W U W c 

(16) 

H\ : The target set W is enriched for differentially 
expressed genescompared with W U W c 

(17) 

It is notable that Z can be calculated without a numer- 
ical randomization of the data. Further, the background 
data consist of all genes V, including the ones in the target 
pathway W. In [77] this method has been applied to head 



and neck and cervical cancer for human papillomaviruses- 
positive and -negative samples. 

GAGE 

GAGE [71] (generally applicable gene set enrichment) is 
also a parametric test that, similarly to GSEA, compares 
the expression in a target gene set with that of the back- 
ground. But instead of using a Kolmogorov-Smirnov like 
test [78] it employes a two-sample t-test. The principle 
steps of the method are as follows: 

1. Estimate the mean fold change f and its standard 
deviation oj for the m genes in the target pathway W. 

2. Estimate the mean fold change f and its standard 
deviation Of for all p genes in the background gene 
set V = W U W c . 

3. Estimate the t-score: 

/-/ 



t = 



yjcfflm + crjjm 



(18) 



with 



df = (m-l) K \ ' 



oj + a J, 



(19) 



degrees of freedom. 



Also GAGE employs all genes V in the background gene 
set, including the ones in the target set W. The underly- 
ing assumption of GAGE is that the (mean) fold changes 
of genes are independent and identically distributed. The 
null hypothesis tested by GAGE is: 



Ho : The mean fold change of genes (MFG) in set 
W is not different to the MFG in WUW C 



(20) 



Hi : The mean fold change of genes (MFG) in set 
W is different to the MFG in W U W c 



(21) 



GSA 

Another method is GSA (gene set analysis) [69]. The 
method, first, calculates z-scores, zu with i e {1, . . . , m}, 
for all m genes in a given target pathway W. Then each 
z-score is transformed into two scores, assessing the sign 

Of Zi. 



5 + (z) = max{z, 0} 



s (z) = — min{z, 0} 



(22) 



(23) 
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This results in two sets of non-zero scores <S + = 
. . . ,s+(z m )} and S~ = {s~(z\) t . . . ,s~(z m )} from 
which their mean value is calculated, 



s = mean 



s = mean 



(<S+) = -£s+fe) 
m ^ 

i 



(24) 



(25) 



Finally, the maxmean test statistic is denned by s mm = 
max{s + , giving the test statistic for the target pathway. 



H 0 : s mm (W) = s mm (W c ) 



(26) 



H l :s mm {W)^s mm (W c ) (27) 
The null distribution is assessed by a restandardization, 
combining a sample- and gene-label permutation. 

Self-contained gene set and pathway methods 

In Table 2 we show an overview of self-contained gene set 
and pathway methods that are describe in the following in 
more detail. 

Sum oft-square 

The sum oft-square test is an univariate test that is based 
on t-scores, {£;}, individually obtained for each of the m 
genes in a given set [95], see also [79]. That means that 
each t-score assesses the difference of the mean expression 
between the two conditions, 
Axi - A/jii 



ti = 



Si 



(28) 



Table 2 Overview of self-contained gene set and pathway 
methods 



Principle Method 


Reference 


Software 


Average of single-gene statistics 


[79] 


sigPathway 


Linear Model Toolset for GSEA 


[80] 


GSEAIm 


SAM-GS 


[81] 




Globaltest 


[82] 


globaltest 


GlobalANCOVA 


[83] 


GlobalAncova 


Hotelling's T 2 


[84-87] 


PC0T2 


N-statistic 


[88] 


cramer 


RCMAT 


[89] 




Non-linear tests for identifying 
differentially expressed genes or 
genetic networks 


[87] 




Pathway-express 


[90] 




Signaling Pathway Impact Analysis 


[91] 


SPIA (Bioconductor) 


SEPEA 


[92] 




PARADIGM 


[93] 




Gene set analysis exploiting the 
topology of a pathway 


[94] 


IPS (available upon 
request) 



with si the pooled standard deviation. The test statistic for 
a pathway is based on the individual t-scores given by 



TS = 



N 



(29) 



Because for each gene A/x; = 0 should hold if a gene 
is not differentially expressed, the null and alternative 
hypothesis can be formulated as: 

Ho : TS = 0 (vanishing test score) (30) 

Hi : TS ^ 0 (non-vanishing test score) (31) 

The significance of TS is assessed from sample-label 
permuted data. 

SAM-GS 

The method SAM-GS (Significance Analysis of Microar- 
ray for gene sets) [81] uses the test statistics, 



SAM-GS = J2 d l> 

k=i 



(32) 



with dk = ♦ Here X\k and x<ik are the sample 

means of the control and treatment condition of gene /c, 
Sk corresponds to its pooled standard deviation and so 
is a constant for a sensitivity adjustment. The null and 
alternative hypothesis can be formulated as: 



Ho : SAM-GS = 0 (vanishing test score) 



(33) 



H\ : SAM-GS ^ 0 (non-vanishing test score) (34) 

Statistical significance of SAM-GS is again assessed 
from sample-label permuted data. 

Hotelling's T 2 

The Hotelling T 2 test is a self-contained test that is a mul- 
tivariate generalization of the univariate t-test. Its null and 
alternative hypothesis can be formulated as: 

Ho : fi T = fi C (equality in the m-dimensional 
population mean vectors) 



(35) 



Hi : ft 7^ ft (difference in the m-dimensional 



(36) 



population mean vectors) 

Suppose we have two groups with nc samples from 
the control group and nj samples for the treatment 
group, each consisting of m genes. Let the expression 
level of the i th sample of the control group and treat- 
ment group be given by Xf = (xf v Xf v . . .Xfirf and 

Xf = (xJ v Xf 2 , . . .X^y, respectively. The pooled covari- 
ance matrix S is then denned by 

(n T - 1)St + (nc - l)Sc 



(n T + nc - 2) 



(37) 
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where Sc and St are the covariance matrices for the con- 
trol and treatment group. Hotelling's T 2 is denned as 



t2 = n T x n c 



. , i. . , (38) 

rir + nc 

The inverse of the covariance matrix is estimated via the 
shrinkage estimator [96-99]. The statistical significance 
of the test statistic T 2 is estimated from sample-label 
permuted data. 

N-statistic 

The N-statistic is a non-parametric test that is used 
to test the equality of two distributions. Suppose the 
expression level of the i th sample of the control group, 
nc, and the treatment group, nj, is given by Xf = 

• • - X ?J and X J = ( x l x l> ■ ■ - X U< res P ec - 

tively. Let / G {1 . . . yiq\ correspond to the control data-set 
and i e {1...ht} to the treatment data-set. The null 
and alternative hypothesis tested by the N-statistic can be 
formulated as: 

H 0 : F c ix) = F T (x) (39) 

H x : F c (x) £ F T (x) (40) 

whereas Fcix) and Fj(x) are two multivariate distribution 
functions from the control and the treatment condition. 
The N-statistic itself is denned as follows: 



N : 



-;tEE*(*7W) 



nc^T 



l c i=l j=i 



-.1/2 



T i=l ;=1 



^7 



(41) 

.ell 



Here K (xf, xj^j, denned as K (xf, x^ = 
is the Euclidean Kernel serving as distance function 
between the expression values in the two conditions. 

Linear model-based pathway methods 

There are also several approaches that utilize either a lin- 
ear or a generalized linear modeling framework for a gene 
set analysis. Examples for such methods are Global test 
[82], Extension of GSEA [80] or GlobalAncova [83]. 

Topological pathway methods based on existing network 
information 

Some recent univariate methods, for instance, Pathway- 
express [90], SPIA [91] or SEPEA [92], use instead of cor- 
relation measures to estimate interactions among genes, 
predefined topological information as provided, e.g., by 
the KEGG database [100]. These methods assign each 
gene in a pathway a score that is based on the posi- 
tion of a gene in the given network structure and, finally, 



aggregate these individual gene scores to obtain a score 
for the pathway itself. Yet another approach is provided 
by PARADIGM [93]. This method uses a factor graph 
model combining gene copy number variation data with 
gene expression data for the identification of differentially 
expressed pathways. 

Iterative Proportional Scaling: IPS 

IPS (Iterative Proportional Scaling) [94,101] is another 
method that utilizes the topology of pathways of a given 
gene set by testing the hypotheses: 

H 0 : X" 1 = E" 1 assuming^- 1 , E" 1 e S+(G) (42) 

H x : H- 1 # E- 1 assuming^- 1 , X" 1 e S + (G) (43) 
In this method, the covariance matrices, X Cl ,X C2 , are 
estimated from the data, for both conditions, using the 
Iterative Proportional Scaling (IPS) algorithm. The inverse 
of the estimated covariance matrices are positive definite 
(concentration) matrices for which it is assumed that the 
non-zero elements in E" 1 and X" 1 are identical; this is 
the meaning of E^X" 1 e S+(G) where S + indicates 
the class of all symmetric positive definite matrices with 
non-zeros elements given by the binary matrix G. This 
means that the concentration matrices, X" 1 and E^, 1 , 
have identical zero element, but are allowed to have dif- 
ferent non-zero entries. In other words, it is assumed that 
the underlying topology of a pathway is the same for con- 
dition c\ and C2, given by G, whereas Gy = 0 corresponds 
to an absent' interactions among the genes i and y. Since 
the structure of G is not estimated from the data, it is nec- 
essary to obtain it from an independent source, e.g., from 
the KEGG database or Reactome [100,102]. In [94] it is 
shown that a log likelihood (log(A)) ratio test can be used 
to test for the equality of the concentration matrices for 
the two conditions and that asymptotically the log likeli- 
hood ratio follows a Chi-square distribution with r + m 
degrees of freedom, i.e., log(A) ~ X?+ m > whereas m is 
the number of genes in the pathway and r is the number 
of non-vanishing edges in G corresponding to the fixed 
interaction structure of the pathway. 

The IPS method has been used in [94] to study acute 
lymphocytic leukemia with and without BCR/ABL gene 
rearrangement. As a result, the JUN oncogene with 
RAS/MAPK/JNK followed by NFAT and NFKB seem to 
be crucial in distinguishing BCR/ABL positive and nega- 
tive patients. 

Differential correlation/interaction methods 

In the previous sections, we discussed different gene set 
and pathway-based methods for the identification of dif- 
ferentially expressed pathways. These methods focused 
either only on the expression of genes, or considered an 
underlying interaction topology among the genes as taken 
from an independent source. However, even when these 
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methods considered an underlying network structure, this 
structure was assumed to be the same for the 'treatment' 
and control' group. 

In contrast, in this section we discuss methods that esti- 
mate the correlation/interaction structure of the genes 
within pathways, for each experimental condition. The 
underlying rationale for these approaches is to assume 
that the expression profiles of genes are dependent on 
each other [103,104] as the genes in a pathway inter- 
act, either directly or indirectly [105]. This assumption 
results from the observation that genes with similar func- 
tions or cellular localization are often co-expressed and 
cluster together. The methods discussed in this section 
bear a similarity to the statistical methods for the esti- 
mation of differential correlated gene-pairs (see section 
'Gene-pair analysis'). However, the extension of such 
gene-pair measures to the pathway level allows the iden- 
tification of pathways that show, e.g., a condition specific 
correlation change. 

In Figure 3E we show a simulated example scenario 
for condition specific correlation changes of the expres- 
sion profiles for a gene set. In Figure 3E the correlation 
between all gene-pairs of a gene set is aggregated by 
a summary statistic. In this example, the mean values 
between the genes is of a comparable order, whereas the 
correlation of the gene set in the treatment condition is 
reduced. 

A variety of different pathway methods have been devel- 
oped that integrate the estimated gene correlations or 
co-expression structures with gene expression data. A 
summary of different methods that are used for the iden- 
tification of differential correlation/interaction changes 
in pathways is shown in Table 3. In the following these 
methods are described in more detail. 

Graph Edit Distance: GED 

Among the first approaches that estimate the interaction 
structure for a pathway analysis to identify differentially 
correlated pathways (DCP) is a method introduced in 
[106]. This method uses the graph edit distance (GED) 
score as a test statistic. 

More precisely, for a given pathway containing m genes 
an association graph, G, also called pseudo-pathway, is 



inferred for each condition. That means the resulting 
network comprises the m genes of this pathway only. 
The inference method estimates correlation and partial 
correlation coefficients and tests their statistical signif- 
icance. That means, if either the correlation or partial 
correlation coefficient for two genes i and ; in this path- 
way vanishes, then the resulting network will not have an 
interaction between gene i and y, i.e., = 0, otherwise 
there is an interaction, Ey = 1. Here E corresponds to the 
adjacency matrix of the network. Suppose G Cl and G C2 are 
two networks that have been inferred from gene expres- 
sion data for condition c\ and C2. Further, assume that 
Mi,M2, . . . , M n are all possible transformations that map 
G Cl into G C2 , i.e, M/(G Cl ) = G C2 . Then the optimal cost 
of the optimal transformation, M f , is given by c(M f ) = 
min{c(M/)|l < i < n). This value is used to define a dis- 
similarity measure <3?ged(G Ci , G C2 ) = c(M ) between the 
two networks G Cl and G C2 , called the graph edit distance 
(GED) score [112]. For arbitrary networks, G Cl and G C2 , 
the estimation of <3?ged(GU> Gb) is numerically challeng- 
ing. However, for our specific problem it can be efficiently 
calculated based on the adjacency matrices, E Cl and E C2 , 
of the two networks, 

2 m 

d G ^(G Cl ,G C2 ) = -j—^ J2 |£* -2*| . (44) 

In [106] the GED score has been used as a test- statistic 
for the formulation of the hypotheses: 

H 0 :d GED (G Cl ,G C2 ) = 0 (45) 

Hi :d GED (G Cl ,G C2 ) ^0 (46) 

In order to assess statistical significance, sample label 
permutations are performed to obtain the null distribu- 
tion. 

Extensions of this method can be found in [113] where 
mutual information values have been used to capture non- 
linear relations among gene expression values. Further, in 
[114] a methods based on a relevance value (RV) has been 
defined for integrating different types of genomics data 
sets which has also a resemblance to the GED. 



Table 3 Overview of methods for the identification of differential correlation/interaction changes in pathways 



Principle Method Reference Software 



Graph edit distance 


[106] 




Gene-set co-expression analysis (GSCA) 


[107] 


GSCA (http://www.biostat.wisc.edu/~kendzior/GSCA/) 


Differential co-expression (dCoxS) between gene-sets 


[108] 


dCoxS (http://www.snubi.org/publication/dCoxS/index.html) 


DiffCoEx 


[109] 


R code is provided in the paper 


Differential disease network using C3NET 


[106,110] 


c3net (http://cran.r-project.org/web/packages/c3net/index.html) 


Disease associated interactions using Synergy network 


[111] 


MATLAB code is provided in the paper 
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Gene set co-expression analysis: GSCA 

A method that is based on (zero-order) correlation coeffi- 
cients is gene set co-expression analysis (GSCA) [107]. This 
method uses as test statistic the dispersion index, which is 
denned as follows: 



(47) 



N *=i 



Here p c k , with c e {c\, c<i\, is the k-th correlation coeffi- 
cient for a gene pair, which can be formed among the total 
number, P = (^), of such pairs for a pathway consisting of 
m genes. The null and alternative hypotheses tested are: 



H 0 :D s (p c \p C2 ) = 0 



h i: a(p c1 ,p C2 )^o 



(48) 



(49) 



From the definition of the dispersion index follows that 
also this method aims at detecting at differential cor- 
relation among pathways, despite its name emphasizing 
co-expression. Interestingly, the dispersion index corre- 
sponds to the GED score if its components in Eqn. 47 
are re-labeled and one defines the components of the 
adjacency matrices E Cl ,E C2 as the correlation coefficients 
rather than the outcome of the hypotheses tests [106]. 

A visualization of the underlying idea of GSCA is 
shown in Figure 3E. The gene expression values are sam- 
pled from multivariate normal distribution N(fii, Xi) and 
N(/i2t £2) w ^ n ^1 = ^2> and the average covariance 
between gene-pairs is Ei = 0.8 and X2 = —0.2. Despite 
the fact that there is neither a difference in the individ- 
ual expression of genes nor the the expression of a set of 
genes, condition 1 and 2 can be distinguished by using a 
measure based on a correlation change. 

Partial least squares based scores: PLS 

A statistical framework based on a partial least squares 
score is proposed in [115]. Similar to the above methods, 
two matrices for the two conditions are inferred. These 
matrices can be seen as weighted networks, whereas an 
edge weight corresponds to the strength of the association 
between two genes. In this paper, three different types of 
tests are introduced that allow (A) testing for changes in 
the module structure of the two networks, (B) testing for 
changed in the connectivity of a particular gene set, and 
(C) testing for changes in the connectivity of a particular 
gene. 

Differentially co-expressed gene sets: dCoxS 

In [108] the differentially co-expressed gene sets (dCoxS) 
algorithm is proposed. This is an entropy-based method 
that uses the interaction score (IS) to measure the dif- 
ference between two pathways. The IS is estimated by 



the correlation coefficient between the entropies of the 
two pathways. The entropies themselves are estimated by 
using the Renyi relative entropy, which is defined by: 



D a (P II Q) = log (J (p a q X - a ) dp dq^j «log 



) 
) 

(50) 



Here a is a parameter and fh(Si) and fh(Sj) are expres- 
sion densities of the samples i and estimated by using 
a multiplicative kernel for the density estimation. Further, 
p and q are the probability density functions of P and Q. 
From this, the IS is estimated by: 



IS = 



J2i<j( REPl -RE P2 )(RE Pl — RE ?2 ) 



(51) 

In this equation, RE Pl and RE ?2 are entropy matrices 
of two gene sets, Pi and P2, for condition c,-, whereas 
each component of the entropy matrices is proportional 

to ~ log^^. That means, strictly IS = IS(Pi(c i ) ) P 2 (c i )). 

fh(Sj) 

Application of Fishers Z-transformation to IS results in 
a z-score, z(^IS(Pi(ci),P2(ci))^, for condition c,-. Combi- 
nation of both z-scores for condition c\ and C2 leads to, 

z(^Is(p 1 (c 1 ) } P 2 (c 1 ))^ -z^IS^P 1 (c 2 ) ) P 2 (c 2 )Jj 



Zcomb : 



VV(«i - 3) + l/(« 2 - 3) 



(52) 



Here n\ and n 2 correspond to the number of samples 
for condition c\ and C2. The interpretation of the null 
hypothesis tested can be stated as: 

^0 ♦ Zcomb — 0 (equality in entropy changes between 
gene-pairs in the pathways Pi and P2 
between the conditions cj and ci) 

(53) 

H\ : Zcomb 7^ 0 (difference in entropy changes between 
gene-pairs in the pathways Pi and P2 
between the conditions ci and ci) 

(54) 

In [108] dCoxS has been applied to gene expression data 
from lung cancer. Their analysis identified the Thrombin 
signaling and protease- activated receptors pathway, which 
is known to be involved in the angiogenesis of lung cancer, 
as the most frequently changed pathway. Another inter- 
esting result found is that all significant pathway pairs had 
a lower interaction score in lung cancer than in the normal 
control group. This might indicate that the variability in 
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form of exploited parallel pathways is in cancer lower than 
in normal cells. 

Gene regulatory networks 

Finally, we would like to mention that also gene regula- 
tory network inference methods have also been used in 
this context. More precisely, several attempts have been 
made to identify disease networks [110,111] that corre- 
sponds to particular pathways. For instance, in [110] the 
C3NET inference method [116,117] has been used to infer 
pathway specific networks for prostate cancer. A struc- 
tural comparison between the pathway- specific networks, 
similar to [106] which is based on testing the hypothesis in 
Eqn. 45, allowed to identify growth and cell cycle related 
pathways. 

On a side note, we would like to add that Gaussian 
graphical models (GGM), also known as Markov random 
fields [118-120], are also frequently used to infer gene reg- 
ulatory networks. This model assumes that all variables 
follow a multivariate normal distribution with a specific 
structure of the inverse of the covariance matrix, & = 
X -1 , whereas & is called the precision or concentration 
matrix. Network inference methods based on GGM make 
use of the relation, 



PijW\ffl = 



(55) 



connecting the partial correlation coefficient of full-order 
(LHS) with the elements of il, coij e ft. The partial correla- 
tion is of full-order (with respect to the number of genes) 
because V\{ij) is the set of all genes excluding i and i.e., 
the largest possible set of genes not considering i and /. 

Several methodological improvements have been sug- 
gested to infer gene regulatory networks based on GGM 
[121-123]. These methods differ in the way the inverse of 
the covariance matrix, X -1 , is estimated and in the sta- 
tistical tests employed to assess significance. The reason 
for these technical variants comes from a variety of prob- 
lems. For instance, if the number of samples is smaller 
than the number of genes, which is typically the case 
for a microarray data set, the sample covariance matrix 
is not positive definite and, hence, not invertible. This 
means that Eqn. 55 cannot be exploited. In order to over- 
come such practical estimation problems, recently, several 
extensions based on the LASSO (least absolute shrinkage 
and selection operator) have been suggested [124-128]. 

Importance of multiple hypotheses testing and 
sample size: An example for differentially 
expressed genes 

Typical microarray experiments measure the concentra- 
tion of thousands of mRNAs simultaneously. For this 
reason, usually, one does not just test one statistical 
hypotheses but dozens, hundreds or even thousands. This 



makes it mandatory to control the overall error rate for 
all the tests, because the probability to make at least one 
error, Pr(V > l\a,t) = (1 — (1 — a)*), for a test with a 
false positive rate of a, increases rapidly with the number 
of tests t, as can be seen in Figure 5. Here, V corresponds 
to the number of false positives. From this one can see 
that even for a moderate number of tested hypotheses, 
e.g., 300, this probability is already almost 100%. Hence, 
each of the three principle types of hypotheses tests dis- 
cussed in the previous sections are severely effected by 
this problem. 

Classically, a Bonferroni correction is used controlling 
the Family wise Error Rate (FWER) [129,130], 



FWER{a) = P(V > 1). 



(56) 



Unfortunately, this method is often too stringent, which 
may give no significant results at all. For this reason, alter- 
native error measures and control procedures have been 
introduced. A recent, very popular measure is the false 
discovery rate (FDR) [131], 



FDR = 



E[ V/R] 
0 




(57) 



controlled by a procedure introduced by Benjamini & 
Hochberg (BH) [131]. Subsequently, various related error 
measures have been proposed like pFDR [132], local 
FDR [133,134] and a variety of other control procedures 
[129,135]. Also extensions have been suggested [136] that 
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allow the control of an error measure in cases where the 
underlying tests are not independent from each other. 
This is particularly important for microarray data that 
contain a none neglectable correlation structure among 
the genes. 

In order to demonstrate the importance and the influ- 
ence of different error measures and control procedures, 
we provide a numerical example, shown in Figure 6. Here, 
'BH' means the Benjamini & Hochberg procedure to con- 
trol the FDR [131], pFDR' indicates the positive false pos- 
itive rate controlled with a method introduced in [132], 



'Bonferrom corresponds to the FWER controlled with a 
Bonferroni correction and localFDR' corresponds to the 
local FDR [134,137]. The local FDR is defined as, 

localFDR = Prob(null |test statistic). (58) 

that means the local FDR is the probability that the null 
model is true conditioned on the observed test statistic. 
The data we used for this analysis correspond to simulated 
gene expression data sampled from a normal distribution 
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with different mean values for the two conditions. More 
precisely, we simulate 2000 genes of which 400 are dif- 
ferentially expressed (true positives). Further, we study 
three different (constant) correlation structures with p = 
{0.0, 0.2, 0.5}. The results shown in Figure 6 are for each 
sample size averaged over 50 independent runs. 

As one can see, 'BH' and pFDR' give more significant 
results and, hence, have a higher power than the Bonfer- 
roni correction and the local FDR when there is either no 
or only a moderate correlation among the genes. However, 
it is important to note that the utility of these methods 
depends on the characteristics of the data. For example, if 
the average correlation in the data is p = 0.0, then pFDR' 
tends to perform best (see Figure 6A). However, when the 
average correlation in the data increases (p = 0.5) then 
the localFDR' [134,137] becomes preferable. We want to 
note, for a sample size of 5, the power of the methods is 
usually very low because only a couple of genes test sig- 
nificant. In addition, a large fraction of these can be false 
positives. This seems to be especially for the local FDR 
method a problem. 

Recommendations 

In general, there is a trade-off between a high power of 
a statistical method on one side, which requires a large 
number of samples, and low experimental costs on the 
other. For the identification of differentially expressed 
genes the results in Figure 6 provide some guidelines. Even 
for the most favorable condition (for p = 0.0) a study will 
usually be underpowered for < 20 samples, however, on 
the other hand, even for 10 samples the Type I error will 
be well-controlled. 

For gene set and pathway-based methods such recom- 
mendation are more delicate. In [105] two self-contained 
(sum oft-square and Hotellings T 2 [84,95]) and one com- 
petitive test (GSEA [27]) have been analyzed. As a results, 
it is suggested not to apply a method unconditionally to 
all pathways in a given data set, but to filter them in order 
to eliminate conditions for which a method is more likely 
to cause problems. This can be seen as a reflection of the 
heterogeneity of cancer, as discussed above in the section 
'Gene expression data and cancer! 

In [105] it has been suggested to filter pathways accord- 
ing to the following criteria: Hotellings T 2 should only be 
applied to pathways with less than 35 genes and a sample 
size larger than 30. The sum of t-square test should only 
be used for pathways with DC > 10% (DC is detection 
call; the percentage of differentially expressed genes in a 
pathway) and a sample size of 25 or larger. GSEA should 
only be used for pathways with DC > 10% and a sample 
size larger than 25. That means for the sum oft-square test 
and GSEA, at least 10% of the genes in a pathway should 
be differentially expressed for the method to work. How- 
ever, this is not independent of the correlation structure 



of the data. In general, in the presence of high correla- 
tions a larger number of differentially expressed genes is 
beneficial for these methods. 

It is important to emphasize that these sample sizes 
are different to the minimal sample sizes necessary in 
order to avoid in addition that a study is underpowered. 
For the minimal sample sizes [105] predict a sample size 
of 59 for Hotellings T 2 and 57 for the sum of t-square 
test and 83 for GSEA. Further, in [95] it was found that 
using the N-statistic with 40 samples (or more) leads to a 
good control of the Type I error and a satisfactory power 
for a variety of differing conditions, including different 
correlations of the data and DC values in the pathways. 
Further studies reviewing related methods can found in 
[61,62,65,139,140]. 

We would like to emphasize that the above recommen- 
dations are data dependent. That means it is not possible 
to judge solely based on the number of samples which 
method to use. Instead, one needs to estimate charac- 
teristics from a particular data set in order to select an 
appropriate test. This implies, e.g., to estimate the cor- 
relation structure and the detection call. In the context 
of cancer there is an additional problem that needs to 
be considered. It is known that a tumor is a heteroge- 
neous collection of cells rather than a homogeneous one 
[26,141]. This translates into the heterogeneity of gene 
expression data [142] making it even more dangerous to 
provide general recommendations without considering a 
particular data set. 

On a general note, we would like to highlight that when- 
ever a given data set allows to (I) identify changes in single 
genes, (II) identify changes in gene sets or pathways, and 
(III) identify changes in the correlation structure in path- 
ways, then methods from each of the three categories (l-lll) 
should be applied and there is no need to focus on just 
one of these. The reason for this is that despite the fact 
that gene set or pathway methods have more explanatory 
power than methods to identify changes in single genes 
[64,95] it does not mean that there are no conditions for 
which single gene methods reveal interesting biological 
information that may not be obtained by the other types 
of methods. For instance, the differential expression of a 
single gene based on changes in the mean (rather than the 
variance) may be an indicator for the presence of a sin- 
gle signaling chain rather than of many parallel pathways. 
Hence, this could provide information about the presence 
of a Mendelian trait or a complex trait that contains a 
strong monogenetic component. It appears that for such 
conditions single gene methods have an advantage over 
gene set or pathway methods, although, the latter meth- 
ods may be adaptable to such question as well. However, 
this may require additional effort. In summary, we rec- 
ommend to use all different approaches (l-lll) side-by side, 
whenever this is permitted by the data, to interrogate the 
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data in the broadest way, because this translates into a 
diverse set of different biological questions. 

Our recommendation complements a common line of 
thought asking for the combination of different types of 
data. Although it is certainly true that combining different 
types of high-throughput data, e.g., from DNA microarray 
and ChlP-chip experiments, is in general more informa- 
tive, it is also more time and cost intensive to generate 
such data combinations. For this reason, frequently, only 
gene expression data are available. Hence, our review pro- 
vides a survey of method to get the most out of expression 
data sets. 

Finally, we would like to emphasize that all methods 
require an appropriate filtering and normalization of the 
data in order to obtain robust and statistically sensible 
results [143,144]. 

Conclusions and discussion 

In the post genomic era, biology transitioned from a 
'gene-centric' to a systems-focused field. This change is 
also reflected in the transition from methods to identify 
'differentially expressed single genes' to approaches for 
finding 'differentially changed pathways! Such a transition 
is natural, because a systems view is required to under- 
stand the complex biological functions inside a cell that 
are responsible for the observable phenotypic outcomes 
[9,11,145]. 

As recent findings in cancer research demonstrate, 
cancer is a heterogeneous disorder, even within a par- 
ticular cancer type. For example, breast cancer is cur- 
rently subcategorized into four major tumor subtypes 
[146]: basal-like, HER2-enriched, luminal (which can be 
further reduced) and normal-like tumors. Considering 
the fact that these results have been achieved by using 
high-throughput data one can expect further refinements 
when data from different hight- throughput technologies 
become available and being combined with each other. For 
this reason, it appears sensible to assume such a hetero- 
geneity not only on the global, phenotype level, but also 
within the cells, on the pathway-level. This implies that a 
pathway-based filtering, as suggested in [105], is necessary 
to apply a method only selectively, and not unconditional, 
to cancer pathways. 

Regarding potential future directions, we expect to see 
an increase in methods that target changes in the cor- 
relation structure in pathways for three reasons. First, 
genes and their products do interact with each other. This 
implies that there exists a correlation structure among 
these entities that represents, potentially, useful biolog- 
ical information that may be missed by co-expression 
based methods [106]. Second, the costs to generate high- 
throughput data are declining, which makes it easier for 
the experimenter to generate a sufficiently large number 



of samples that enables such an analysis. This is an impor- 
tant point, since the required sample sizes for a pathway 
analysis is considerably larger than for single gene analy- 
ses. Third, biologically, the hallmarks of cancer point to a 
few pathways as pivotal elements in the molecular elucida- 
tion of carcinogenesis, e.g., Wnt/Notch signaling, Hedge- 
hog signaling or DNA damage control [147-149]. Hence, 
semantically, pathway studies enable the systematic con- 
nection of oncogenes, tumor-suppressor genes and stabil- 
ity genes [150] to provide fundamental insights into causal 
mechanisms underlying cancer. Unfortunately, the tempo- 
rary literature especially of methodological papers discuss 
their results rarely in the framework of the hallmark path- 
ways. For this reason, we suggest that future studies aim 
for a conceptual discussion of their results within this 
enlightening framework. Not because it provides the final 
answers to understand cancer [151], but due to the fact 
that it enables a systematic approach to the emperor of all 
maladies [152]. 

Reviewers' comments 

First of all, we would like to thank all referees for their 
fruitful suggestions and comments. In the following, we 
kept our answers to the raised issues short but included 
our responses in the main text. 

Referee 1 : Dr. Arcady Mushegian 

The manuscript by Emmert-Streib and colleagues is a 
review of statistical methods for analysis of gene expres- 
sion data, but it is also much more than that. It is relatively 
rare for the statisticians to review all classes of such meth- 
ods and to give an eminently logical classification not only 
of the techniques on which the methods are based, but 
also of the kinds of questions that are asked when applying 
these methods. This, certainly, is a strength of the work 
and the reason why it should appeal to the biologists that 
would like to have a deeper insight into which methods 
are appropriate to which task at hand. 

I have, however, several comments that rank somewhere 
between suggestions and concerns. Most importantly, the 
authors propose to distinguish three groups of meth- 
ods: those that identify changes in single genes, those 
that identify changes in gene sets or pathways, and those 
that identify changes in the correlation structure in path- 
ways. (By the way, in the Abstract and elsewhere, the 
description of the groups is almost the same as above, 
but "changes" are substituted by "differential changes" - 
is it not a tautology, in particular when there are only 
two samples?). Then, in discussing the first two classes of 
methods, the authors almost in every case give a clear for- 
mulation of the question that is being asked of the data, 
in the form of the statistical hypothesis about the data 
that is being tested. This is an excellent way of explain- 
ing things. Unfortunately, it is not consistently applied: 
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even among these classes of methods, the hypotheses are 
not mentioned, and then, upon discussing the differential- 
correlation methods (pp. 15-18), the hypotheses are not 
explicated at all, except for the IPS method. I think this 
need to be changed, and the null hypotheses need to be 
stated for all methods for which this is possible; and if the 
framework is such that no explicit null hypothesis exist, 
this needs to be discussed, and the applicable intuitive 
formulation be given. 

Reply: 

We appreciate this suggestion and added to all methods 
the definition of their null hypothesis. In addition, we 
extended the discussion in section 'Formulating biologi- 
cal hypotheses' explaining why it can be difficult to find a 
biological interpretation for a null hypothesis and we offer 
some explanation for this. 

Question: 

My other concern is abut Figures 3 and 4. The authors 
never state what the data points there represent. They 
must be expression values for two genes, but how are these 
data collected - are they technical replicates? biological 
replicates? some kind of ordered series? unordered series 
such as for example different drug treatments? Does it 
matter what of the above they are? 

Reply: 

We added an explanation of the data, which are simu- 
lated data to visualize the principle idea underlying some 
methods, to the corresponding methods. 

Question: 

The third shortcoming of the paper is that there is a signif- 
icant disconnect between well-covered methodology and 
the stated goal of discussing the application to cancer biol- 
ogy. In fact, the short discussion about cancer hallmarks is 
an excellent introduction that points out the way in which 
analysis of gene expression can lead to the understand- 
ing of changes in expression of particular ("hallmark") 
pathways. This theme, however, is not followed through. 
Though occasionally we read that such and such method 
was applied to analysis of a particular type of cancer, 
there is never any discussion of what was found in gene 
expression data that allowed an insight into cancer biol- 
ogy. What happens to the hallmark pathways at the level 
of gene expression programs? Which methods have been 
used to support (or maybe question?) which aspects of the 
hallmark hypothesis? Which pathways were predicted or 
shown to be differentially regulated at the transcription or 
mRNA concentration level? 

Reply: 

We agree with the reviewer that 'Which methods have 
been used to support (or maybe question?) which aspects 



of the hallmark hypothesis? ' is an important questions. 
Unfortunately, the methodology oriented literature does 
rarely touch this topic in a clear manner. That means 
in order to extend the paper in this direction we could 
not survey these issues but would need to establish such 
results. Instead, the concern in our paper is to propa- 
gate such an approach in the context of the presented 
methods. A discussion has been added to 'Discussion and 
conclusions! 

Question: 

Finally, there is the question of, if you will, general biol- 
ogy of transcriptional response. It stands to reason, and 
indeed has been occasionally shown, that in order for a 
pathway to be regulated, it may not be necessary to reg- 
ulate all its components at the same (in this case, mRNA 
concentration) level. One may find that the gene prod- 
uct amounts are regulated at different levels, or maybe 
even only one or a few, e.g., rate-limiting, components 
are regulated at all. This would argue that single gene- 
based methods may in these cases provide a better clue to 
the process than pathway-based or gene set enrichment- 
based methods. It would be interesting to know whether 
this has been observed in the cancer datasets. A related 
question is about the rules of thumb in pathway analysis: 
for example, if a typical pathway (network module?) has a 
size of N genes, what is the number of genes in this path- 
way m < N that would still register as an enrichment in 
some of the tests that the authors discuss? 

Reply: 

This is an important point. We included a discussion of 
this to section 'Recommendations! We added also a dis- 
cussion of the danger of general suggestions and motivate 
this by known characteristics of gene expression data from 
cancer. The problem is twofold. First, each method has its 
own characteristics under what conditions it works best. 
Second, data sets from cancer are very heterogeneous so 
that two data sets containing about the same number of 
samples can exhibit a very different correlation structure 
and expression patterns. This holds potentially also for 
different grades of one cancer type. 

Regarding the first question, it appears to us that this 
is related to the presence or absence of parallel pathways 
conveying a molecular signal. If for example no paral- 
lel pathways exist the detection of differentially expressed 
genes can provide a robust way to detect functional 
changes. On the other hand, if there are many alter- 
natives this may not be the case and gene set meth- 
ods appear to be better suited for such a situation. In 
general, this kind of cross-method comparisons are not 
well studied and we are not aware that this has been 
systematically addressed for cancer or other data sets. 
One reason for this is that until recently, most data sets 
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contained less than 20 samples per condition, which usu- 
ally does not permit a robust analysis of gene set or 
pathway methods and once larger data sets became avail- 
able the detection of differentially expressed genes was 
neglected, potentially, due to the erroneous assumption 
that differentially expressed gene set methods include the 
former tests. 

In order to emphasize that it is desirable to apply meth- 
ods from all three different levels simultaneously ((I) iden- 
tify changes in single genes, (II) identify changes in gene 
sets or pathways, and (III) identify changes in the corre- 
lation structure in pathways) whenever a given data sets 
allows this, rather than to focus on just one of these levels, 
we added a discussion to section 'Recommendations! 

Thank you for your suggestions and comments. 

Referee 2: Dr. Byung-Soo Kim 

General comments This is a well organized review of 
recent statistical methods of analyzing microarray exper- 
iment data sets, particularly on cancers, from single gene 
analysis to identifying differential changes in pathway, and 
finally to comparing a given pathway under two differ- 
ent conditions. However, I would like to indicate following 
four points for the possible improvement. (1) Gaussian 
graphical model: From the methodological point of view, 
it is desired to include the sparse Gaussian graphical 
model (GGM) approach for estimating the gene net- 
work under the multivariate normal assumption from a 
microarray data set. For the recent development of GGM 
approach one can include glasso (Friedman, Hastie and 
Tibshirani, 2008; Witten, Friedman and Simon, 2011) 
[125,127], SCAD penalty of Fan, Feng and Wu (2009) 
[124], adaptive lasso of Zou (2006) [128] and Kiiveri (2012) 
[126], among others. (2) Effect of inter-gene correlations 
on the single gene analysis. A series of Efrons recent 
work (Efron 2007a, 2007b) [134,137] discussed in detail on 
how inter-gene correlations could affect the detection of 
differentially expressed (DE) genes in a single gene anal- 
ysis? By including Efrons recent work and his R package 
"loefdr" authors can show how FDR can be used in the 
real data analysis in their Section on "Importance of mul- 
tiple hypotheses testing and sample size: An example for 
differentially expressed genes". (3) Some of the reviews 
are misleading. These are the few examples, (i) The sen- 
tence, at the middle of page 12, "However, in order to 
use a two-sample t-test with equal size of the two sam- 
ples it is assumed that the mean fold change f and its 
standard deviation oj would be the same for a randomly 
selected background set consisting of only m genes, see 
Eqn. 10". Actually ([99], Luo et al, 2009) assumes the i.i.d 
of the fold change of genes to make Eqn 10 have a t 
distribution. Here the key assumption was the indepen- 
dence, which was missing in the aforementioned sentence. 



(ii) p. 14. Eqn 16. In ([126], Tian et al, 2005) no t-square 
statistic was employed, (iii) Eqn 24 of p. 18 does not make 
sense. Authors of ([20], Cho et al, 2009) didn't make it 
clear in their equation (3) what Renyi entropy was when 
the underlying random variables were continuous, (iv) I 
would suggest authors to allocate more space on the work 
of ([90], Massa et al, 2010) which was methodologically 
sound and deserve more coverage than just the IPS algo- 
rithm. (4) Inconsistency of notations. In page 11 authors 
denned p and m to be sizes of the background genes and 
a target gene set, respectively. However, in line 2 of page 
15 "p genes" (which should have been m genes accord- 
ing to page 11 definition) was incorrectly labeled. This 
inconsistency was repeated in N-statistic section of p. 15, 
and also in Eqn 16 in p. 14 and Eqn 22 of p. 17. The 
"p-dimensional..." should be "m-dimensional..." at the bot- 
tom two lines of p. 14. 

Minor Comments 

1. p.2 "Gene expression data from next generation 
sequencing (RNA-seq)". This is an important issue. 
There is no direct relevance, however, with statistical 
methods reviewed in this paper. 

2. p.4. For detecting differential correlation and 
differential variance, it would be better to explain 
why these approaches were taken. For example, in 
([54], Ho et al, 2008) it was clearly indicated that 
changes in expression variability were associated 
with changes in coexpression pattern, which implied 
that DV was a signal rather than a noise. 

3. Legend of Figure 2. "The data is..." should be "The 
data are... ". 

4. p.7. There is no reference of Figures A, B in the main 
text. Also indicate in the legend of Figure 3 what £ is 
in Figure 3E. 

5. p.8. In the legend of Figure 4 what the symbols in the 
outer-panel represent? What do the lines represent? 
It is better to use different notation (A, B) to avoid 
confusion in the main text of the second paragraph 
of p. 9. 

6. p.9 What is "alpha" in Equation (4)? 

7. p.9 line -9. You may include two specific patterns of 
dependence of two genes, namely, type A dependence 
of Klevanov, Jordan and Yakovlev (2006), and hidden 
regulator dependence of Lim, Kim and Kim (2011). 

8. p.15 line -13. "euclidean Kernel" should be 
"Euclidean kernel" (9) 

9. p. 15 line -10. "a either" should be "either". 

10. p. 15 line -8. Author may want to include Tsai and 
Chen (2009) for another reference of Hotellings 
T-square statistic. 

11. p.17. line 15, What are "A" and "B"? 

12. p. 18 line 2. Better to include Lauritzen (1996) as a 
reference of IPS algorithm. 
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13. p. 22. It would be more beneficial for the read to 
move the last paragraph of p. 22 (extended to p. 23) 
to Introduction section. 

References Efron B. (2007a). Correlation and large-scale 
simultaneous signifance testing, J. Amer. Statist. Assoc. 
102:93-103. Efron B. (2007b). Size, power and false discov- 
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and SCAD penalties. Ann. Statist. 3:521-541. Friedman J, 
Hastie T, Tibshirani R. (2008). Sparse inverse covariance 
estimation with the graphical lasso. Biostatistics 9:432- 
441. Lauritzen SL. (1996), Graphical models, Oxford: 
Clarendon Press. Lim J, Kim J, Kim BS (2011). An alterna- 
tive model of type A dependence in a gene set of correlated 
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Reply: 

We revised our text correspondingly and addressed all 
your suggestions. We would like to point out that the 
major goal of our review is not a full coverage of statistical 
details but to provide sufficient information for the reader 
to acquire a basic understanding of major principles and 
assumptions that underly the methods. The problem is 
that if too many detail are presented the paper would 
turn quickly into a formal description which may not be 
appreciated by a biology oriented readership. 

Minor Comments 

1. p.12. line 1: What is N? 

2. p. 15. line -7 -5: "two i.i.d samples of genes..." is rather 
confusing. Luo et al. (2005) assumed the i.i.d of the 
fold change of genes, which was much stronger than 
just assuming equal mean and variance. It is better to 
rewrite this sentence to convey the original material. 

3. p.17. line -1: "p genes" should be "m genes". 

4. p.22. Eqn. (50): What are p and q? What are S t and Sj ? 

5. p.37. Reference 30, p. 38. References 40, 59; The 
journal title should be consistent with Reference 27 
or vise versa. 

6. p.40. References 90, 108: Location of the publisher is 
missing. 

7. p.40. References 93,94; The journal title should be 
consistent with Reference 119. 



8. p.40. Reference 109: Author was duplicated at the 
end. The location and the publisher were missing. 

9. p.41. Reference 117. The article title is missing. 

10. p.41. Reference 118: The location of the publisher is 
missing. 

11. p.41. Reference 133. The journal title should be 
consistent with Reference 27. 

Reply: 

All comments have been addressed and we revised the 
main text correspondingly. 
Thank you for your suggestions and comments. 

Referee 3: Dr. Joel Bader 

This manuscript reviews methods for analyzing gene 
expression data with tests of individual genes, gene pairs, 
gene sets, and networks. The manuscript is strong in cov- 
ering many methods. It would be more helpful if the 
authors also provided a point of view or evaluation of 
methods. Can anything be said about the relative power 
of different approaches, or which have proven to be 
more useful in practice? What about the tradeoff between 
robustness, power, and speed for realistic data? Most of 
the discussion of method choice is generally about sam- 
ple size requirements for all methods rather than method 
choice given sample size. The two parts of the manuscript, 
gene expression and cancer, don't really mesh. Most of the 
methods review is not cancer specific. Possibly of greater 
relevance to cancer are methods that combine different 
types of data. 

The manuscript is generally well written and easy to 
understand, with ample references to the original work 
and to previous reviews. 

Minor corrections 

1. p. 1 'one gene, — > should be ' for open-quote in 
latex, here and elsewhere 

2. p. 2 differnt microarray — > spelling 

3. p. 2 comprises, e.g., mRNAs — > e.g.' doesn't sound 
right here. How about providing a full list: mRNA, 
tRNA, rRNA, and short regulatory RNAs 

4. p. 2 In the third step the reads are mapped to known 
exon sequences of genes/ Are there also de novo 
assembly methods that don't require a template? 
'allows to overcome' — > overcomes 

5. p. 3 allows to measure — > measures. Can also 
mention other advantages: splice variants, sequence 
polymorphisms, no need to design and build a 
custom chip 

6. correspond to: self sufficiency' — > no colon 
between preposition and noun phrase. Can the 
hallmarks be parallel, all start with noun or verb? 

7. p. 9 Eq. 4. How is alpha calculated? 

8. Eq. 5 need i = 1 underneath the summation 
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9. p. 14. Eq. 16 Under the null, it seems that £^2 should 
approach 1 / *f{p) rather than 0. 

10. Eq. 16 How is the significance of SAM-GS calculated? 

11. p. 18 Eq. 24 and text after, use log in math mode 
rather than log. 

Reply: 

All comments have been addressed and we revised the 
main text correspondingly. 
Thank you for your suggestions and comments. 
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